Skip to content

[Reporting][DataFiltering] Expand data padding to full simulation range#2867

Open
ew3361zh wants to merge 33 commits intodevfrom
expand-data-fix-2
Open

[Reporting][DataFiltering] Expand data padding to full simulation range#2867
ew3361zh wants to merge 33 commits intodevfrom
expand-data-fix-2

Conversation

@ew3361zh
Copy link
Collaborator

New default behavior for data padding is to pad data across the entire simulation length.

Context

Issue(s) closed by this pull request: closes #2725

What

  1. New default data padding behavior to pad to full simulation length.
  2. New option to pad data prior to first observed report of data use_fill_value_before_start.

Why

Repeated confusion from users on data not being padded as expected if the reporting frequency is not daily.

How

  • Added default behavior of the Utility expand_data_temporally() is to now pad the data for the full length of the simulation. This will allow any data reported to more easily match the dataset length of the variables reported daily.
  • Added option to turn this behavior off by using the expand_data_to_observed_range filter option set to true.
  • Added option to pad data before the first day the data is reported (limit to start of simulation) just in case the first day this variable is reported is not the first day of the simulation. The behavior here will take the first observed value and add that same value to the front of the padded dataset.
  • The option of whether or not to use a prescribed fill value is still available for the start, gaps, and end of the dataset (each separately or together).

Test plan

I tried making an exhaustive filter to see the behavior:

{
  "multiple": [
  {
    "name": "Expand to Observed (Info Maps) Range - no fill value",
    "filters": [
      "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass"
    ],
    "expand_data": true,
    "expand_data_to_observed_range": true,
    "use_fill_value_before_start": false,
    "use_fill_value_in_gaps": false,
    "use_fill_value_at_end": false
  },
  {
    "name": "Expand full simulation - fill everywhere",
    "filters": [
      "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass"
    ],
    "expand_data": true,
    "fill_value": "FILL",
    "use_fill_value_before_start": true,
    "use_fill_value_in_gaps": true,
    "use_fill_value_at_end": true
  },
  {
    "name": "Expand full simulation - fill everywhere - no fill value provided",
    "filters": [
      "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass"
    ],
    "expand_data": true,
    "use_fill_value_before_start": true,
    "use_fill_value_in_gaps": true,
    "use_fill_value_at_end": true
  },
  {
    "name": "Expand full simulation - use carried values everywhere",
    "filters": [
      "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass"
    ],
    "expand_data": true,
    "use_fill_value_before_start": false,
    "use_fill_value_in_gaps": false,
    "use_fill_value_at_end": false
  },
  {
    "name": "Expand full simulation - fill start only",
    "filters": [
      "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass"
    ],
    "expand_data": true,
    "fill_value": "FILL",
    "use_fill_value_before_start": true,
    "use_fill_value_in_gaps": false,
    "use_fill_value_at_end": false
  },
  {
    "name": "Expand full simulation - fill gaps only",
    "filters": [
      "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass"
    ],
    "expand_data": true,
    "fill_value": "FILL",
    "use_fill_value_before_start": false,
    "use_fill_value_in_gaps": true,
    "use_fill_value_at_end": false
  },
  {
    "name": "Expand full simulation - fill end only",
    "filters": [
      "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass"
    ],
    "expand_data": true,
    "fill_value": "FILL",
    "use_fill_value_before_start": false,
    "use_fill_value_in_gaps": false,
    "use_fill_value_at_end": true
  },
  {
    "name": "No expansion",
    "filters": [
      "Feed.Silage.Bunker.corn_silage_storage_1.total_dry_matter_mass"
    ]
  }
  ]
}

Input Changes

Output Changes

  • N/A

Filter

@github-actions
Copy link
Contributor

Current Coverage: 99%

Mypy errors on expand-data-fix-2 branch: 1213
Mypy errors on dev branch: 1213
No difference in error counts

@github-actions
Copy link
Contributor

Current Coverage: 99%

Mypy errors on expand-data-fix-2 branch: 1213
Mypy errors on dev branch: 1213
No difference in error counts

@github-actions
Copy link
Contributor

Current Coverage: 99%

Mypy errors on expand-data-fix-2 branch: 1213
Mypy errors on dev branch: 1213
No difference in error counts

@github-actions
Copy link
Contributor

Current Coverage: 99%

Mypy errors on expand-data-fix-2 branch: 1213
Mypy errors on dev branch: 1213
No difference in error counts

@ew3361zh ew3361zh requested a review from allisterakun March 18, 2026 13:05
@github-actions
Copy link
Contributor

Current Coverage: 99%

Mypy errors on expand-data-fix-2 branch: 1213
Mypy errors on dev branch: 1213
No difference in error counts

Copy link
Collaborator

@allisterakun allisterakun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you Niko! Great improvements!

Copy link
Collaborator

@morrowcj morrowcj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to advocate that pandas.NA be used as the default "fill_value" instead of numpy.NaN. From a data science perspective NaN (not a number) and NA (missing value) are meaningfully different: the former indicates that an operation produces a nonsensical result (e.g., division by zero) but the latter indicates that a value was not observed or measured. In this case, the fill value is meant to indicate that the variable in question was not measured at this time step, so NA is the more appropriate value.

Also, I think that the default behavior for non-aggregated data should be to fill gaps for each time step in the model. My reasoning for this is that it retains the full time series, which contains information about when variables were measured and when they weren't. It also ensures that multiple variables line up, which is crucial for any data analysis or comparison. Ultimately, I think that this behavior should be built into the output manager, but padding the data by default in the interim is a fine bridge. I'm happy to discuss these points in more detail at any time.


edit: I suppose an empty cell ("") would also be acceptable in the CSV output instead of pandas.NA. That is typically how excel denotes missing values. But explicit would be better IMO.

@JoeWaddell
Copy link
Collaborator

To also update from conversations yesterday: in testing this I found the use of regex and dictionaries caused this functionality to break and stop the simulation. @ew3361zh is investigating further!

@ew3361zh
Copy link
Collaborator Author

ew3361zh commented Mar 24, 2026

To also update from conversations yesterday: in testing this I found the use of regex and dictionaries caused this functionality to break and stop the simulation. @ew3361zh is investigating further!

Whew, after some investigation, I think our culprit is indeed caused when the filtering process finds a variable that was added to OM as a dictionary. For example in AnimalModuleReporter with nutrient_amounts:

nutrient_amounts = {
            "dm": average_nutrition_supply.dry_matter,
            "CP": average_nutrition_supply.crude_protein,
            "ADF": average_nutrition_supply.adf_supply,
            "NDF": average_nutrition_supply.ndf_supply,
            "lignin": average_nutrition_supply.lignin_supply,
            "ash": average_nutrition_supply.ash_supply,
            "phosphorus": average_nutrition_supply.phosphorus * GeneralConstants.GRAMS_TO_KG,
            "potassium": average_nutrition_supply.potassium_supply,
            "N": average_nutrition_supply.nitrogen_supply,
            "as_fed": average_nutrition_supply.wet_matter,
            "EE": average_nutrition_supply.fat_supply,
            "starch": average_nutrition_supply.starch_supply,
            "TDN": average_nutrition_supply.tdn_supply,
            "DE": average_nutrition_supply.digestible_energy_supply,
            "calcium": average_nutrition_supply.calcium * GeneralConstants.GRAMS_TO_KG,
            "fat": average_nutrition_supply.fat_supply * GeneralConstants.KG_TO_GRAMS,
            "fat_percentage": average_nutrition_supply.fat_percentage,
            "forage_ndf": average_nutrition_supply.forage_ndf_supply,
            "forage_ndf_percent": average_nutrition_supply.forage_ndf_percentage,
            "ME": average_nutrition_supply.metabolizable_energy,
            "NE_maintenance_and_activity": average_nutrition_supply.maintenance_energy,
            "NE_lactation": average_nutrition_supply.lactation_energy,
            "NE_growth": average_nutrition_supply.growth_energy,
            "metabolizable_protein": average_nutrition_supply.metabolizable_protein,
        }
        om.add_variable(f"ration_nutrient_amount_for_{pen_base_name}", nutrient_amounts, info_map)

The padding process uses the fill value as is which then produces an issue when trying to save this to a CSV.

So take this filter which works on dev:

 {
            "name": "Farmgrown Feed DMI total, kg",
            "filters": [
                "FeedManager._log_feed_deductions.farmgrown_feed_.*_fed"
            ],
            "expand_data": true,
            "use_fill_value_in_gaps": true,
            "use_fill_value_at_end": true,
            "fill_value": 0
        }

The reason it works on dev but NOT the feature branch is that on dev we don't have any filling happening AFTER the last piece of data reported to OM whereas on the feature branch, it's filling in one 0 on the final day of the simulation (which may be related to a separate issue Kristan made about the reporting of data being a day off or the adjustment we made to RufasTime in this PR maybe).

Essentially on the feature branch we have a list of dictionaries for each value in that parent variable dict that look like: [{"simulation_day": 0, "amount": 0.0}, {"simulation_day": 1, "amount": 0.0}...{"simulation_day": x, "amount": x}, 0]

The fill value set in the filter is 0 and the padding process only pads with the exact fill_value. It's not doing anything to try and accommodate the original data format.

So right now dev is not properly handling dictionary padding either.

This to me feels like a separate issue is needed for it rather than trying to tackle it here. I can make it a higher priority to make sure it gets fixed quickly. Let me know if that sounds ok reviewers @JoeWaddell @allisterakun @morrowcj

@ew3361zh
Copy link
Collaborator Author

I would like to advocate that pandas.NA be used as the default "fill_value" instead of numpy.NaN. From a data science perspective NaN (not a number) and NA (missing value) are meaningfully different: the former indicates that an operation produces a nonsensical result (e.g., division by zero) but the latter indicates that a value was not observed or measured. In this case, the fill value is meant to indicate that the variable in question was not measured at this time step, so NA is the more appropriate value.

I think this makes sense to me. I can implement here.

Also, I think that the default behavior for non-aggregated data should be to fill gaps for each time step in the model. My reasoning for this is that it retains the full time series, which contains information about when variables were measured and when they weren't. It also ensures that multiple variables line up, which is crucial for any data analysis or comparison. Ultimately, I think that this behavior should be built into the output manager, but padding the data by default in the interim is a fine bridge. I'm happy to discuss these points in more detail at any time.

So you're saying the default behavior should be expand_data = True? I agree with this as well but would love other input from @JoeWaddell and @allisterakun

@morrowcj
Copy link
Collaborator

morrowcj commented Mar 24, 2026

This to me feels like a separate issue is needed for it rather than trying to tackle it here. I can make it a higher priority to make sure it gets fixed quickly. Let me know if that sounds ok reviewers

That sounds OK to me.

So you're saying the default behavior should be expand_data = True?

Yes - at least for all report files that contain any raw (non-aggregated) or daily data. Though, I too am interested in the others' perspectives.

@github-actions
Copy link
Contributor

Current Coverage: 99%

Mypy errors on expand-data-fix-2 branch: 1215
Mypy errors on dev branch: 1215
No difference in error counts

@github-actions
Copy link
Contributor

Current Coverage: 99%

Mypy errors on expand-data-fix-2 branch: 1215
Mypy errors on dev branch: 1215
No difference in error counts

@github-actions
Copy link
Contributor

Current Coverage: 99%

Mypy errors on expand-data-fix-2 branch: 1212
Mypy errors on dev branch: 1212
No difference in error counts

@JoeWaddell
Copy link
Collaborator

Yes - at least for all report files that contain any raw (non-aggregated) or daily data. Though, I too am interested in the others' perspectives.

I think across the board the default assumption is that output variables are reported on a daily basis and only once unless otherwise stated (see all the ration interval reporting that currently happens).

I'll need to think a bit more about how we'd want to handle those cases, and what other cases might be exceptions to the general rule of one output per day.

@ew3361zh
Copy link
Collaborator Author

Yes - at least for all report files that contain any raw (non-aggregated) or daily data. Though, I too am interested in the others' perspectives.

I think across the board the default assumption is that output variables are reported on a daily basis and only once unless otherwise stated (see all the ration interval reporting that currently happens).

I'll need to think a bit more about how we'd want to handle those cases, and what other cases might be exceptions to the general rule of one output per day.

We discussed this a little at the dev team wt today and I think there's some concern at users not understanding that padding has occurred if it is the default behavior. This could lead to some potential misinterpretation of the data produced depending on how it looks. So I think if we want to go this route it has to be fully understandable to new users (and experienced users used to a certain behavior). Would NaNs as padding be intuitive enough? I'm not sure.

@github-actions
Copy link
Contributor

Current Coverage: 99%

Mypy errors on expand-data-fix-2 branch: 1212
Mypy errors on dev branch: 1212
No difference in error counts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

expand_data_temporally doesn't expand values to end of simulation

4 participants